Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning
نویسندگان
چکیده
Since most real-world applications of classification learning involve continuous-valued attributes, properly addressing the discretization process is an important problem. This paper addresses the use of the entropy minimization heuristic for discretizing the range of a continuous-valued attribute into multiple intervals. We briefly present theoretical evidence for the appropriateness of this heuristic for use in the binary discretization algorithm used in ID3, C4 , CART, and other learning algorithms. The results serve to justify extending the algorithm to derive multiple intervals. We formally derive a criterion based on the minimum description length principle for deciding the partitioning of intervals. We demonstrate via empirical evaluation on several real-world data sets that better decision trees are obtained using the new multi-interval algorithm. Introduction Classification learning algorithms typically use heuristics to guide their search through the large space of possible relations between combinations of attribute' values and classes. One such heuristic uses the notion of selecting attributes locally minimizing the information entropy of the classes in a data set (d. the ID3 algorithm (13) and its extensions, e.g. GID3 (2), GID3* (5), and C4 (15), CART (1), CN2 (3) and others). See (11; 5; 6) for a general discussion of the attribute selection problem. The attributes in a learning problem may be nominal (categorical), or they may be continuous (numerical). The term continuous" is used in the literature to refer to attributes taking on numerical values (integer or real); or in general an attribute with a linearly ordered range of values. The above mentioned attribute selection process assumes that all attributes are nominal. Continuous-valued attributes are discretized prior to selection , typically by paritioning the range of the attribute into subranges. In general, a discretization is simply a logical condition , in terms of one or more attributes, that serves to partition the data into at least two subsets. In this paper, we focus only on the discretization of continuous-valued attributes. We first present a result about the information entropy minimization heuristic for binary discretization (two-interval splits). This gives us: . a better understanding of the heuristic and its behavior 1022 Machine Learning :;. . formal evidence that supports the usage of the heuristic ;;' in this context , and . a gain in computational effciency that results in speeding . up the evaluation process for continuous-valued attribute discretization. We then proceed to extend the algorithm to divide the range of a continuous-valued attribute into multiple intervals rather than just two. We first motivate the need for such a capability, then we present the multiple interval generalization, and finally we present the empirical evaluation results confirming that the new capability does indeed result in producing better decision trees. Binary Discretization A continuous-valued attribute is typically discretized during decision tree generation by partitioning its range into two intervals. A threshold value for the continuous-valued attribute is determined, and the test ::
منابع مشابه
Discretization of Continuous-valued Attributes and Instance-based Learning
Recent work on discretization of continuous-valued attributes in learning decision trees has produced some positive results. This paper adopts the idea of discretization of continuous-valued attributes and applies it to instance-based learning (Aha, 1990; Aha, Kibler & Albert, 1991). Our experiments have shown that instance-based learning (IBL) usually performs well in continuous-valued attribu...
متن کاملA New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining
Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...
متن کاملValue Difference Metrics for Continuously Valued Attributes
Nearest neighbor and instance-based learning techniques typically handle continuous and linear input values well, but often do not handle symbolic input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between symbolic attribute values, but it largely ignores continuous attributes, using discretization to map continuous values into symb...
متن کاملMulti-interval Discretization of Continuous Attributes for Label Ranking
Label Ranking (LR) problems, such as predicting rankings of financial analysts, are becoming increasingly important in data mining. While there has been a significant amount of work on the development of learning algorithms for LR in recent years, preprocessing methods for LR are still very scarce. However, some methods, like Naive Bayes for LR and APRIORI-LR, cannot deal with real-valued data ...
متن کاملSolving robot selection problem by a new interval-valued hesitant fuzzy multi-attributes group decision method
Selecting the most suitable robot among their wide range of specifications and capabilities is an important issue to perform the hazardous and repetitive jobs. Companies should take into consideration powerful group decision-making (GDM) methods to evaluate the candidates or potential robots versus the selected attributes (criteria). In this study, a new GDM method is proposed by utilizi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1993